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EFFICIENT  PARALLEL  COMPUTATION  OF  ILU(K)  PRECONDITIONERS 

DAVID  HYSOM  AND  ALEX  POTHEN* 


Abstract.  We  report  the  development  of  a  parallel  algorithm  for  computing  ILU  preconditioners.  The 
algorithm  attains  a  high  degree  of  parallelism  through  employment  of  a  two-level  ordering  strategy,  coupled 
with  a  subdomain  graph  constraint  that  regulates  the  location  of  nonzeros  in  the  Schur  complement.  Ex¬ 
perimental  results  include  timings  on  four  parallel  platforms,  for  problems  with  up  to  20  million  unknowns 
running  on  up  to  216  processors.  The  results  support  our  theoretic  analysis  that  the  algorithm  is  highly 
scalable,  for  both  preconditioner  computation  (factorization)  and  application  (triangular  solve)  stages. 

Key  words,  incomplete  factorization,  preconditioning,  parallel  ILU  preconditioning 

Subject  classification.  Computer  Science 

1.  Introduction  and  Background.  Solving  large,  sparse  systems  of  linear  equations  of  the  form 
Ax  —  b  is  a  key  component  in  many  scientific  and  engineering  numerical  computations.  Such  systems 
typically  arise  during  the  discretization  or  linearization  of  systems  of  partial  differential  equations  (PDFs) 
in  the  spatial  and  time  domains.  Since  the  LU  factors  tend  to  be  dense  even  when  A  is  sparse,  iterative 
solution  methods  are  increasingly  the  solution  methods  of  choice. 

It  is  frequently  desirable  to  improve  a  system’s  convergence  properties  by  preconditioning.  A  precon¬ 
ditioner  M  transforms  the  linear  system  into  a  related  system  possessing  better  convergence  properties, 
M~xAx  —  M~lb.  Parallel  algorithms  have  recently  been  developed  for  computing  approximate  inverse  pre¬ 
conditioners,  but  most  results  show  them  to  be  less  effective,  in  terms  of  iterations  required  for  convergence, 
than  are  serial  preconditioners  based  on  incomplete  factorizations  (ILU)  of  A .  However,  while  ILU  precon¬ 
ditioners  are  popular,  effective,  and  robust,  they  are,  according  to  the  conventional  wisdom,  primarily  serial 
in  nature. 

We  have  developed  and  implemented  coarse-grained  algorithms  that  perform  well  for  2D  and  3D  prob¬ 
lems.  High  performance  is  achieved  through  use  of  local  and  global  orderings  of  partitioned  matrices,  coupled 
with  the  constraint  that  subdomain  graphs,  which  capture  communication  patterns,  be  identical  for  A  and 
the  LU  factors.  Our  algorithmic  framework  supports  threshold-based  (ILUT)  and  level  based  (ILU(k)) 
factorization  methods,  with  optional  partial  pivoting  (ILUTP,  ILUP(K)). 

ILU  preconditioning  is  based  on  the  computation  of  triangular  factors  L  and  U,  where  LU  «  A.  Pre¬ 
conditioner  application  reduces  to  the  solution  of  two  triangular  systems,  Ly  =  b  and  Ux  =  y,  during  each 
iteration.  The  factors  are  commonly  grouped  together  as  F  =  L  +  U  —  I.  Several  algorithms  for  computing 
the  triangular  factors,  L  and  U,  are  known  and  widely  used  in  serial  contexts. 

ILUT  algorithms  perform  row-by-row,  upward-looking  factorizations,  discarding  elements  that  are  smaller 
than  a  given  value.  Perhaps  the  most  popular  formulation  is  ILUT(r,p)  [13],  which  employs  a  dual-dropping 
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strategy.  The  first  parameter  r  is  the  dropping  threshold,  while  the  second  parameter,  p,  limits  the  amount 
of  fill  in  any  row  of  the  matrix.  If  a  row  in  the  original  matrix  A  has  r  nonzero  entries,  then  a  maximum 
of  r  +  p  entries  are  permitted  in  the  corresponding  row  of  F.  The  limit  p  is  sometimes  applied  separately 
to  upper  and  lower  triangular  portions  of  the  row.  As  with  other  ILU  algorithms,  both  symmetric  and 
non-symmetric  variants  exist. 

In  structurally  based  ILU(k)  algorithms,  factorization  is  separated  into  symbolic  and  numeric  phases. 
The  permitted  locations  of  nonzero  entries  in  the  factor  are  determined  during  the  symbolic  phase,  based  on 
the  level,  k.  All  entries  in  the  original  matrix  are  assigned  a  level  of  0  and  are  permitted  in  the  factor.  During 
factorization,  whenever  two  matrix  entries  cause  a  new  entry,  the  entry  is  assigned  a  level  based  on  the  levels 
of  the  causative  entries.  If  this  level  is  less  than  or  equal  to  k  the  entry  is  permitted  in  the  factor.  Since  a 
new  entry  may  be  caused  by  different  pairs  of  existing  fill  elements,  the  new  entry  retains  the  minimum  value 
of  all  computed  levels.  ILU(k)  techniques  were  known  as  early  as  the  1960s,  and  were  originally  developed  in 
the  context  of  solving  finite  difference  equations  for  elliptic  PDEs.  The  review  article  [3]  provides  historical 
references. 

Two  different  functions  have  been  used  for  assigning  levels  to  newly  created  fill  entries.  Following  most 
present  day  implementations,  we  use  the  sum  definition,  which  assigns  the  new  entry  the  sum  of  the  levels 
of  the  two  causative  entries,  incremented  by  one.  For  this  definition,  the  length  of  a  fill  path  corresponds  to 
the  number  of  times  an  entry  is  divided  by  a  pivot  value  during  numerical  factorization. 

The  “classical”  algorithm  for  computing  ILU(k)  structures  operates  by  mimicking  upward-looking,  row- 
oriented  factorization.  This  algorithm  is  commonly  interpreted  as  a  unioning  of  the  rows  of  F,  although  it 
also  has  graph  theoretic  interpretations. 

The  upper  triangle  of  any  matrix  F  can  be  associated  with  a  directed  graph,  G(Fu)  =  (V,  £7),  which  has 
has  vertex  set  V  and  edge  set  E.  A  directed  edge  {v,w}  €  E  if  and  only  if  there  is  a  nonzero  matrix  entry, 
fViW  in  the  upper  triangle  of  F.  We  assume  the  reader  is  familiar  with  the  concept  of  paths  in  graphs.  We 
call  G(Fjj)  the  directed  adjacency  graph  of  the  upper  triangle  of  F.  Similar  graphs,  G(F)  and  G(Fl ),  can 
be  constructed  for  the  complete  matrix  or  the  matrix’s  lower  triangle. 

The  classical  algorithm,  then,  can  be  interpreted  as  factoring  row  j  by  conducting  a  graph  search  in 
G(Fu)  starting  from  all  nodes  i,  where  the  problem  matrix  A  contains  a  nonzero  entry  dij. 

A  newer  Graph  Search  Algorithm,  which  we  have  described  elsewhere  [11],  computes  structures  identical 
to  those  of  the  classic  algorithm,  but  operates  by  conducting  breadth-first  searches  in  the  underlying  graph 
of  G(Aji)  (J  G(Au).  This  algorithm  arises  from  a  theorem,  cited  below,  which  is  an  extension  of  the  fill  path 
theorem  [12],  originally  developed  in  the  context  of  complete  factorizations.  The  fill  path  theorem  states 
that  fill  edges  can  only  exist  between  vertices  joined  by  a  fill  path. 

Definition  1.1.  A  fill  path  is  a  path  joining  two  vertices  v  and  w ,  all  of  whose  interior  vertices  are 
numbered  lower  than  the  minimum  of  the  numbers  of  v  and  w. 

THEOREM  1.2.  The  edge  {u,w}  is  a  fill  edge  of  level  k  if  and  only  if  there  exists  a  shortest  fill  path  in 
G(A)  joining  v  and  w  whose  length  is  k  4- 1. 

D’Azevedo,  Forsyth,  and  Tang  defined  fill  levels  in  terms  of  shortest  fill  path  lengths  in  [6],  although 
they  used  the  language  of  reachability  sets  rather  than  fill  paths.  Hence  they  knew  the  result  in  Theorem  1, 
but  did  not  state  or  prove  the  theorem. 

Theorem  1.2  has  an  intuitively  simple  geometric  interpretation.  Construct  a  sphere  of  radius  k  +  l  about 
any  node  i  in  a  graph.  Then  a  fill  entry  fij  can  only  be  permitted  in  an  ILU(k)  factor  if  j  is  within  the 
sphere.  This  interpretation  will  be  used  in  the  next  section  to  explain  our  constraint  on  the  processor  graph. 
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Many  other  ILU  algorithms  are  known;  these  include  combinations  of  partial  pivoting  with  ILU(k) 
and  ILUT,  preservation  of  row-sums  in  the  factor,  drop-tolerance  versions  of  ILU(k),  Fast  Graph  Search 
techniques  [11],  etc. 

Although  ILU(k)  algorithms  are  widely  known  and  implemented  [1,  4,  14],  virtually  all  studies  we  have 
come  across  in  the  literature  have  been  limited  to  ILU(O)  or  ILU(l).  This  is  somewhat  unfortunate  since  our 
own  work,  as  reported  in  Section  3,  has  shown  high-level  ILU(k)  preconditioning  to  be  highly  effective  for 
many  problems.  We  define  a  high-level  ILU(k)  preconditioner  as  one  whose  structure  results  from  an  ILU(k) 
factorization,  with  k  >  1.  As  a  rule  of  thumb,  for  problems  with  several  hundred  thousand  unknowns,  we 
are  interested  in  levels  as  high  as  ten  or  twelve. 

The  structural  and  numerical  relationships  between  ILU(k)  and  ILUT  factors,  particularly  for  higher- 
level  ILU(k),  have  not  been  well  studied,  although  for  some  simple  cases  (certain  symmetric,  diagonally- 
dominant  matrices)  the  algorithms  can  be  shown  to  produce  identical  factors.  Published  works  have  tended 
to  report  iterations  and  solution  timings,  but  element-by-element  comparisons  between  the  factors  themselves 
have  apparently  not  been  undertaken. 

The  relationships  between  matrix  orderings  and  ILU  preconditioning  performance  is  a  complex  issue 
which  has,  and  continues  to  be,  studied  by  numerous  researchers  [2,  5,  7,  8,  9].  At  least  four  interrelated 
effects  can  be  identified  in  parallel  contexts.  First,  matrix  ordering  can  be  used  to  provide  parallelism,  i.e, 
as  a  partitioning  method.  Second,  from  a  structural  viewpoint,  altering  a  matrix’s  ordering  and  ILU(k)  or 
ILUT(r,_p)  parameters  changes  the  amount  and/or  pattern  of  fill;  this  affects  the  amount  of  work  required 
during  setup  and  application  phases.  Since  altering  the  fill  count  and/or  pattern  also  changes  a  precon¬ 
ditioner’s  numerical  properties,  the  third  and  fourth  effects  are  alterations  in  convergence  properties  and 
numerical  stability. 

Since  our  algorithmic  approach  intrinsically  changes  a  problem’s  ordering,  at  various  places  in  the  fol¬ 
lowing  sections  we  comment  on  how  this  effects  parallelism,  total  work  (solution  time),  and  convergence 
properties.  We  do  not  address  the  stability  issue  in  this  work. 

2.  Parallel  Algorithmic  Framework.  Our  approach  is  based  on  three  constraints,  two  of  which  are 
merely  reflective  of  real-world  problems  and  hardware.  First,  we  stipulate  that  the  number  of  processors  be 
small  in  relation  to  the  problem  size:  p  <£  N.  This  is  typically  the  case  for  existing  and  planned  hardware 
platforms  (e.g,  ASCI  Teraflop  machines)  where  we  may  have  thousands  of  processors  but  need  to  solve 
equations  with  millions  of  unknowns. 

Second,  we  assume  the  matrix  A  is  well-partitionable.  This  is  analogous  to  physical  domain  decompo¬ 
sition  such  that  subdomains  have  large  volume  to  surface  ratios.  To  generalize  this  notion,  we  call  node  i 
(and  its  associated  matrix  row)  interior  if,  for  every  nonzero  entry  a ij  or  a?-*,  nodes  i  and  j  belong  to  the 
same  subdomain.  Conversely,  i  is  a  boundary  node  if  there  is  at  least  one  edge  a^-  or  aji  such  that  nodes 
i  and  j  belong  to  different  subdomains.  In  general,  then,  a  matrix  is  well-partitioned  if  all  subdomains 
have  large  interior  to  boundary  node  ratios.  Spatially  discretized  partial  differential  equations  are  frequently 
well-partitionable. 

Prior  to  stating  our  final  constraint  we  define  a  subdomain  graph  as  a  graph  whose  nodes  correspond  to 
subgraphs  (subdomains)  and  whose  edgeset  contains  edge  (r,  5)  if  subgraphs  r  and  s  are  joined  by  one  or 
more  edges. 

Our  third  constraint,  then,  is  that  the  subdomain  graph  of  the  factor  F  and  the  matrix  A  be  identical. 
This  constraint  permits  the  determination  of  all  communication  patterns  prior  to  actual  factorization  phases, 
as  discussed  below. 
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Within  each  subdomain  we  require  that  interior  nodes  be  ordered  before  boundary  nodes.  By  the  fill  path 
theorem  [12],  this  ensures  that,  if  an  edge  fij  arises  during  factorization  and  crosses  subdomain  boundaries, 
then  nodes  i  and  j  must  have  been  boundary  nodes  in  the  graph  of  A.  In  other  words,  ordering  interior 
nodes  first  in  the  graph  of  A  ensures  that  interior  nodes  can  not  be  converted  to  boundary  nodes  in  the 
graph  of  F.  This  permits  the  independent  factorization  of  interior  nodes  within  each  subdomain.  It  also 
preserves  parallelism  during  preconditioner  application,  since  subdomain  interior  nodes  can  be  evaluated 
independently  during  the  triangular  solves. 

The  independence  of  subdomain  interiors  also  allows  selection  of  ILU  algorithms  and  parameters  (e.g, 
level,  for  ILU(k))  on  a  subdomain  by  subdomain  basis.  Hence,  our  framework  is  ideally  suited  for  adaptive 
preconditioning  in  multi-physics  problems,  wherein  different  models  are  used  for  different  physical  subdo¬ 
mains. 

For  the  remainder  of  this  paper  we  assume  a  one-to-one  and  onto  mapping  between  subdomains  and 
processors.  Also,  due  to  the  intimate  connection  between  matrices  and  graphs  in  the  context  of  matrix 
factorization,  we  use  the  phrases  eliminating  a  node  and  factoring  a  row  synonymously. 

Figure  2.1  contains  a  high-level  overview  of  our  algorithm.  Comments  on  particulars  follow. 

Input:  A  well-partitioned,  distributed  matrix. 

1.  Form  subdomain  graph  and  order  vertices  to  reduce  di¬ 
rected  path  lengths  by  a  vertex  coloring. 

2.  On  each  processor,  locally  order  interior  nodes,  then  order 
boundary  nodes. 

3.  Factor  interior  rows.  Processors  having  no  lower-ordered 
neighbors  in  the  subdomain  graph  also  factor  boundary 
rows,  and  go  to  step  6. 

4.  Receive  row  structure  and  values  of  boundary  rows  from 
lower-ordered  neighbors  in  the  subdomain  graph. 

5.  Factor  boundary  rows. 

6.  Send  boundary  row  structures  and  values  to  higher-ordered 
neighbors  in  the  subdomain  graph. 

Fig.  2.1.  Parallel  Algorithmic  Framework 


The  first  step,  which  induces  a  global  matrix  reordering,  generally  requires  global  communication.  How¬ 
ever,  due  to  the  constraint  p  <^L  N,  this  computation  is  not  time-constraining.  If  special  topological  informa¬ 
tion  is  available,  e.g.,  a  regular  2D  or  3D  grid  has  been  used  for  discrectization,  this  step  can  be  accomplished 
without  communication. 

Steps  four  through  six  can  potentially  give  rise  to  sequential  bottlenecks.  If  processors  r  and  s  are 
neighbors  in  the  subdomain  graph  and  r  is  ordered  before  s,  then  processor  s  cannot  complete  factorization 
of  its  boundary  rows  until  it  receives  factored  boundary  row  structural  and  numerical  values  from  r .  Regular 
2D  and  3D  grids  can  easily  be  ordered  in  red  black  fashion  so  that  dependency  paths  for  the  subdomain 
graph  of  A  are  at  most  of  length  one.  Additional  dependencies,  however,  can  arise  during  factorization,  as 
illustrated  in  Figure  2.2. 

It  is  not  always  possible  to  compute  in  advance  where  these  dependencies  will  arise.  Further,  if  all 
dependencies  are  permitted,  numerous  synchronization  points  may  be  required  in  order  to  determine  which 
processors  must  send  what  to  whom.  Our  third  constraint,  that  subdomain  graphs  of  A  and  F  be  identical, 
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Fig.  2.2.  Left:  ordered  processor  graph  of  A;  all  dependency  paths  are  of  length  one.  Right:  some  of  the  additional 
dependencies  that  can  arise  during  factorization;  there  is  now  a  path  length  of  three,  1,  3,  4 >  0,  which  necessitates  three 
non- overlapping  communication  phases  for  transmittal  of  boundary  row  structure  and  values. 


Fig.  2.3.  2D  grid  after  global  and  local  reordering  phase.  Level  1  and  level  2  fill  edges  which  cross  subdomain  boundaries 
are  shown.  The  four  dotted  lines  between  processors  P-2  and  P-4  7s  subdomains  indicate  fill  edges  that  are  prohibited  due  to 
the  subdomain  graph  constraint.  All  level  0,  and  level  1  and  2  edges  not  crossing  subdomain  boundaries,  have  been  omitted  for 
clarity. 


dispenses  with  the  need  for  this  sort  of  synchronization. 

The  constraint,  however,  may  alter  the  factor’s  structural  and  numerical  properties  by  preventing  the 
inclusion  of  some  fill  edges.  Figure  2.3  is  intended  to  provide  an  intuitive  characterization  of  prohibited  fill 
edges.  For  a  typical  2D  grid,  we  see  that  prohibited  edges  can  arise  near  points  where  subdomains  touch 
corners  but  are  not  neighbors  in  the  subdomain  graph.  Theorem  1  tells  us  that  these  edges  can  only  arise 
within  a  radius  of  k  +  1  edges  about  such  points. 

If  desirable,  the  subdomain  graph  constraint  can  be  relaxed,  as  long  as  the  structure  of  the  subdomain 
graph  of  F  can  be  determined  before  factorization  begins.  Acceptable  subdomain  graphs  for  F>  for  example, 
can  be  computed  by  performing  ILU(k)  factorizations  on  the  subdomain  graph  of  A. 
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Several  other  variants  of  our  algorithm  are  possible.  For  example,  in  earlier  work  on  serial  preconditioning 
we  developed  algorithms  that  compute  symbolic  factors  based  solely  on  the  nonzero  structure  of  A  [11]. 
Incorporating  these  algorithms  into  our  framework  permits  complete  separation  of  symbolic  and  numeric 
stages  for  ILU(k)  factorizations.  This  may  be  especially  advantageous  since  we  have  found-with  the  exception 
of  ILU(0)-that  the  cost  for  symbolic  factorization  tends  to  be  equal,  or  slightly  greater  than,  that  for 
numeric  factorization.  These  algorithms  are  also  particularly  suitable  for  implementation  in  shared  memory 
environments. 

We  have  recently  become  aware  of  work  by  Karypis  and  Kumar  [10],  which  is  similar  to  ours  in  that  they 
employ  partitioning,  followed  by  elimination  of  interior  nodes,  followed  by  elimination  of  boundary  (in  their 
terminology,  interface)  nodes.  Their  experimental  results  are  encouraging,  since  they  support  our  thesis  that 
this  general  approach  is  effective,  highly  parallel,  and  scalable.  We  note  the  following  differences  between 
their  work  and  ours. 

Karypis  and  Kumar’s  work  centers  on  ILUT,  while  we  are  developing  object-oriented  code  which  supports 
tailored  selection  of  numerous  ILU  variants  on  a  sub  domain- by-subdomain  basis.  Additionally,  our  use  of 
the  fill  path  Theorem  [12],  along  with  Theorem  1,  its  counterpart  for  ILU(k),  enables  us  to  present  (see 
Section  4)  theoretical  performance  analyses.  Our  subdomain  graph  constraint  enables  an  easy  computation 
of  communication  patterns;  in  contrast,  Karypis  and  Kumar  compute  subdomain  dependencies  in  an  ad  hoc 
manner,  as  factorization  progresses. 

3.  Results.  Results  in  this  section  are  based  on  the  following  model  problems. 

Problem  1.  Poisson’s  equation  in  two  or  three  dimensions 

A  u  —  0. 

Problem  2.  Convection-diffusion  equation  with  convection  in  the  xy  plane 

-eAu  +  - ^-exyu  4-  ^-e~xyu  =  g. 
ox  oy 

Homogeneous  boundary  conditions  were  used  for  both  problems.  Derivative  terms  were  discretized  on  the 
unit  square  or  cube,  using  3-point  central  differencing  on  regularly  spaced  nx  x  ny  x  nz  grids  ( nz  =  1  for 
2D).  The  zero  vector  was  used  for  the  right-hand  side  of  the  resulting  systems,  Ax  =  6,  and  random  vectors 
with  values  in  the  range  [-1, 1]  were  used  for  initial  guesses.  The  coefficient  for  Problem  2  was  e  —  1/500, 
which  yields  a  moderately  unsymmetric  system  of  equations. 

Although  much  present  day  scientific  modeling  results  in  systems  of  Partial  Differential  Equations  (PDEs) 
far  more  complicated  than  our  model  problems,  most  of  our  parallel  code’s  behavior  can  be  understood  by 
examining  these  simple  problems  in  conjunction  with  ILU(k)  preconditioning.  ILU(k)  preconditioning  is 
amenable  to  performance  analysis  since  the  structures  of  ILU (k)  preconditioners  are  identical  for  any  PDE 
that  has  been  discretized  on  a  2D  or  3D  grid  with  a  given  stencil.  The  structure  depends  on  the  grid 
and  the  stencil  only,  and  is  not  affected  by  numerical  values.  Identical  structures  imply  identical  symbolic 
factorization  costs,  as  well  as  identical  flop  counts  during  the  numerical  factorization  and  solve  phases.  In 
parallel  contexts,  communication  patterns  and  costs  are  also  identical.  While  preconditioner  effectiveness- 
the  number  of  iterations  until  the  stopping  criteria  is  reached-differs  with  the  numerics  of  the  particular 
problem  being  modeled,  the  parallelism  available  in  the  preconditioner  does  not. 

The  structure  of  ILUT  preconditioners,  on  the  other  hand,  is  a  function  of  the  grid,  the  stencil,  and  the 
numerics.  Changing  the  problem,  particularly  for  non-diagonally  dominant  cases,  can  alter  the  preconditioner 
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structure,  even  when  the  grid  and  stencil  remain  the  same.  Moreover,  even  for  diagonally  dominant  problems, 
the  structure  of  ILUT(r,p)  preconditioners  is  a  function  of  two  parameters,  r  and  p,  which  makes  for  a  much 
more  complicated  experimental  space  than  when  structure  is  controlled  by  the  single  parameter,  k. 

These  are  the  primary  reasons  why,  even  though  our  algorithmic  framework  is  suitable  with  use  of 
ILUT(r,p)  and  other  ILU  variants,  we  find  performance  evaluation  more  meaningful  when  carried  out  in  the 
context  of  ILU(k).  In  addition,  most  ILU(k)  implementations  separate  symbolic  from  numeric  factorization, 
thus  permitting  an  easy  determination  of  the  time  spent  in  each  stage.  We  typically  see  that  the  symbolic 
phase-during  which  no  floating  point  operations  are  issued-takes  as  long  or  longer  than  the  numeric  phase. 
These  relationships  are  very  difficult  to  determine  for  ILUT,  since  the  symbolic  and  numeric  phases  must  be 
interleaved  on  a  row-by-row  basis. 

In  addition  to  demonstrating  that  our  algorithm  can  provide  high  degrees  of  parallelism,  several  other 
issues  must  be  addressed.  Our  results  show  that  available  parallelism  increases  with  level,  so  we  need 
to  examine  the  effectiveness  of  high-level  preconditioners  for  reducing  total  solution  time.  Since  memory 
requirements  are  directly  related  to  fill  count,  we  should  consider  whether  the  storage  costs  imposed  by 
high-level  preconditioners  are  acceptable.  As  there  is  not  much  point  in  parallelizing  operations  which  only 
account  for  very  small  proportions  of  execution  time,  we  need  to  examine  how  much  run  time  is  typically 
consumed  by  preconditioner  computation.  Finally,  since  matrix  orderings  can  significantly  effect  execution 
time  for  both  ILU(k)  and  ILUT,  we  should  look  at  how  the  matrix  orderings  required  by  our  approach  affect 
convergence  behavior. 

3.1.  Preconditioner  Effectiveness.  In  this  subsection  we  consider  the  effectiveness  of  high-level  pre¬ 
conditioners;  look  at  the  relationships  between  factorization  time,  total  execution  time,  and  level;  examine 
storage  costs;  and  consider  how  the  orderings  imposed  by  our  algorithms  effect  convergence.  These  topics 
are  fairly  easy  to  address  since  they  can  be  adequately  examined  in  sequential  environments.  The  results 
are  important,  however,  because  they  lay  the  groundwork  for  interpreting  our  code’s  performance  in  parallel 
environments,  presented  in  the  next  subsection. 

Experiments  in  this  subsection  were  conducted  on  a  Sun  Ultra- 30,  with  1024  Megabytes  of  main  memory 
and  a  CPU  clock  rate  of  300  MHz.  Problem  1  was  solved  using  the  Conjugate  Gradient  method  and  left 
preconditioning.  Problem  2  was  solved  using  GMRES  with  restart=30,  Unmodified  Gram-Schmidt,  and 
1  step  iterative  refinement  orthogonalization.  The  solvers  were  based  on  optimized  code  from  the  PETSc 
library,  with  our  own  routines  used  to  measure  CPU  time.  Stopping  criteria  was  ||  r*  ||2<  rtol  with  rtol, 
the  relative  decrease  in  the  residual  norm,  ranging  from  10“5  to  10-7. 

Since  Problem  1  is  symmetric,  it  could  be  preconditioned  using  Incomplete  Cholesky  Factorization 
(IC(k))  rather  than  ILU(k).  However,  if  the  problem  were  augmented  by  the  addition  of  a  first  derivative 
term  the  resulting  system  would  unsymmetric  and  we  would  be  forced  to  use  ILU(k),  even  if  the  first-order 
term  had  near-negligible  coefficients.  We  therefore  believe  that  reporting  on  ILU(k)  rather  than  IC(k)  makes 
the  results  more  general. 

Total  solution  times  are  reported  in  preference  to  iteration  counts  since  work-per-iteration  is  largely 
dependent  on  the  total  amount  of  fill  permitted  in  the  preconditioner.  Also,  evaluating  preconditioner 
effectiveness  by  iteration  count  comparison  ignores  preconditioner  setup  (factorization)  time,  which  can 
constitute  a  considerable  proportion  of  total  execution  time. 

We  begin  by  reviewing  some  statistics  concerning  ILU(k)  behavior  in  general.  Table  3.1  summarizes 
ILU(k)  behavior  for  Problem  1.  The  table  shows  total  solution  time,  the  percentage  of  total  solution  time 
devoted  to  preconditioner  factorization,  and  the  ratio  of  nonzeros  in  the  preconditioner  to  that  in  the  problem 
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Table  3.1 

Convergence  behavior  for  Laplace’s  equation  (Problem  1)  on  2D,  360  x  360  grid. 


level 

nzF/nzA 

rtol 

solution  time 

=  10~5 

setup  ratio  (%) 

rtol 

solution  time 

=  10~7 

setup  ratio  (%) 

0 

1.0 

10.1 

5 

33.2 

1.6 

1 

1.4 

7.8 

12 

24.6 

3.6 

2 

1.8 

7.1 

15 

21.6 

4.9 

3 

2.6 

6,4 

25 

19.1 

8.2 

4 

3.4 

6.3 

32 

17.6 

11.6 

5 

4.2 

6.9 

39 

16.8 

16.4 

6 

4.9 

6.5 

50 

16.4 

20.0 

7 

5.7 

7.4 

54 

15.7 

25.0 

8 

6.5 

8.0 

62 

16.4 

30.0 

9 

7.3 

8.8 

65 

16.6 

34.8 

Table  3.2 

Convergence  behavior  for  Convection- Diffusion  equation  (Problem  2)  on  2D,  360  x  360  grid. 


level 

nzF/nzA 

rtol 

solution  time 

=  10"5 

setup  ratio  (%) 

rtol 

solution  time 

=  10-7 

setup  ratio  (%) 

0 

1.0 

39.2 

1.3 

165.8 

0.3 

1 

1.4 

34.3 

2.2 

108.2 

0.7 

2 

1.8 

33.5 

2.7 

99.5 

0.9 

3 

2.6 

31.4 

4.1 

82.1 

1.5 

4 

3.4 

30.2 

5.8 

77.6 

2.2 

5 

4.2 

28.0 

8.7 

69.4 

3.5 

6 

4.9 

26.4 

11.2 

56.9 

5.2 

7 

5.7 

25.4 

13.8 

54.3 

6.6 

8 

6.5 

23.5 

18.8 

54.0 

8.0 

9 

7.3 

23.6 

21.8 

48.4 

10.5 

10 

8.1 

25.8 

25.6 

50.0 

13.2 

(nzF/nzA). 

There  are  three  points  to  be  noted  in  this  data.  First,  the  most  time-efficient  solutions  require  relatively 
high  fill  levels.  Second,  as  greater  resolution  is  required  (smaller  rtol),  the  optimal  preconditioner  level 
increases.  Changing  the  stopping  criteria  by  a  factor  of  100,  from  rtol  =  10-5  to  rtol  —  10~7,  results  in  a 
change  from  ILU(4)  to  ILU(7),  for  fastest  solution.  Finally,  for  the  most  time-efficient  solutions,  between  1/3 
and  1/4  of  execution  time  was  spent  in  the  preconditioner  setup  phase  (symbolic  and  numeric  factorization). 
Application  of  Amdahl’s  Law  tells  us  that,  if  we  can  not  effectively  parallelize  ILU(k)  factorization,  we  can 
never  obtain  more  than  a  three-fold  speedup. 

Table  3.2  contains  similar  statistics  for  Problem  2.  The  same  trends  noted  for  Problem  1  are  evident 
here,  although  the  fastest  solutions  required  an  even  greater  fill  level. 

It  is  illuminating  to  note  how  the  amounts  of  fill  in  ILU(k)  preconditioners  compare  with  the  amounts 
commonly  permitted  in  ILUT  preconditioners.  In  [2],  Benzi,  et  al.,  compared  preconditioner  performance 
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Table  3.3 

Comparison  of  ILU(k)  and  ILUT  fill  counts,  2D  domain. 


grid  size 

ILUT(.01,5), 

ILUT(.001,10) 

ILU(l) 

ILU(2) 

ILU(3) 

ILU  (4) 

ILU  (5) 

128  x  128 

31 

154 

16 

31 

62 

93 

124 

256  x  256 

126 

628 

64 

128 

256 

383 

509 

300  x  300 

174 

865 

88 

177 

353 

528 

702 

400  x  400 

310 

1544 

158 

316 

630 

944 

1257 

Table  3.4 

Comparison  of  ILU(k)  and  ILUT  fill  counts,  3D  domain. 


grid  size 

ILUT(.01,5), 

ILUT(.001,10) 

ILU(l) 

ILU(2) 

ILU  (3) 

16  x  16  x  16 

14 

38 

9 

22 

47 

32  x  32  x  32 

119 

313 

83 

216 

475 

40  x  40  x  40 

236 

615 

167 

439 

971 

50  x  50  x  50 

466 

1206 

336 

885 

1965 

between  ILU(O),  ILU(l),  ILUT(.005,5),  and  ILUT(.001,10)  *,  for  a  variety  of  problems  and  matrix  orderings. 
We  noticed  that,  for  most  of  the  cases  reported,  the  amount  of  fill  permitted  in  the  ILUT  preconditioners 
exceeded  that  permitted  in  the  ILU(k)  factors.  This  motivated  us  to  wonder  what  ILU(k)  levels  would  be 
required  to  produce  amounts  of  fill  comparable  to  those  in  the  ILUT  factors.  Tables  3.3  and  3.4  show  this 
comparison.  The  figures  for  the  ILUT  levels  are  from  [2].  The  figures  for  ILU(k)  were  computed  using  the 
PETSc  library.  In  these  tables  we  use  Benzi’s  method  of  reporting  fill  counts,  which  is  the  amount  of  fill 
greater  than  level  zero,  in  either  the  upper  or  lower  triangle  of  the  factor. 

Even  though  the  entries  are  computed  differently,  every  entry  in  an  ILUT  factor  can  be  assigned  the 
level  it  would  have,  if  it  were  permitted  in  an  ILU(k)  factor.  Since  The  fill  counts  in  the  2D  ILUT(.001,10) 
factors  in  Table  3.3  exceed  the  fill  counts  for  the  ILU(5)  factors,  it  follows  that  the  ILUT  factors  must  include 
at  least  some  level  six,  or  larger,  entries. 

High-level  preconditioners  are  effective  for  many  problem  types.  The  Driven  Cavity  set  from  the 
SPARSKIT  Collection  consists  of  a  series  of  non-symmetric,  indefinite  matrices  which  arise  from  model¬ 
ing  of  the  incompressible  Navier-Stokes  equations.  They  are  reportedly  difficult  to  solve  iteratively  due 
to  the  high  condition  numbers  of  the  coefficient  matrices.  All  the  problem  can  be  solved,  however,  using 
high-level  preconditioners.  Table  3.5  shows  a  typical  problem,  for  which  the  fastest  solution  time  was  ob¬ 
tained  with  ILU(3)  preconditioning.  This  problem  can  also  be  solved  using  high-level  ILUT  preconditioning, 
although  we  found  that  solution  times  were  slower,  with  larger  amounts  of  fill  required  to  force  convergence. 

The  use  of  high-level  preconditioning  is  counter-intuitive  in  the  sense  that  a  primary  reason  for  employing 
iterative  solvers  is  to  reduce  storage  requirements.  There  are,  however,  large  gaps  between  the  amounts  of  fill 
we  have  been  talking  about  and  that  required  for  direct  factorization.  Figure  3.1  shows  storage  requirements 
as  a  function  of  level,  for  k  =  0, . . . ,  K ,  where  ILU^)  results  in  complete  factorization.  The  conclusion  is 
that  memory  requirements  for  high-level  preconditioning,  given  the  levels  we  have  been  discussing,  do  not 
begin  to  approach  that  needed  for  direct  solution  methods. 

While  the  curve  in  Figure  3.1  shows  fill  growth  typical  of  what  we  have  observed  for  all  medium  to 

1  More  accurately,  they  used  SILUT,  a  modification  of  the  ILUT  algorithm  which  gives  rise  to  symmetric  preconditioners 
when  the  original  matrix  is  symmetric. 
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Table  3.5 

Iterative  performance  for  Driven  Cavity  problem  e40r3000.  ILU(l)  converged  very  slowly,  and  was  terminated  before 
achieving  exit  criteria. 


level 

solution 

time  (s) 

PC  formation 
time  (%) 

residual 

norm 

iterations 

1 

95.7 

8 

8.0e-3 

NA 

2 

91.8 

14 

3.0e-5 

299 

3 

49.2 

49 

1.8e-5 

85 

4 

58.2 

58 

3.6e-5 

56 

5  : 

64.5 

64 

4.7e-5 

29 

6 

78.7 

85 

1.5e-5 

27 

3D  grid,  n  =  16,  N  =  4096 


level 


Fig.  3.1.  Storage  requirements  vs.  fill  level  for  16  x  16  grid 

large  matrices  studied,  a  word  of  caution  is  in  order.  Smaller  matrices,  such  as  most  in  the  Harwell  Boeing 
collection,  tend  to  have  relatively  small  maximum  levels.  For  such  problems,  employing  ILU(k)  for  k  >  1 
may  result  in  factors  whose  sizes  approach  those  for  complete  factorization. 

We  now  turn  to  the  question  of  how  our  algorithm’s  matrix  reorderings  affect  convergence.  Although  our 
algorithm  admits  a  high  degree  of  parallelism  (as  demonstrated  in  the  following  subsection),  the  parallelism 
will  not  be  of  much  utility  if  an  imposed  ordering  results  in  a  significant  degradation  of  convergence  behavior. 

There  are  several  ways  we  might  investigate  this  question.  For  example,  we  could  completely  solve  linear 
systems  in  parallel  environments,  using  our  preconditioner  code  in  conjunction  with  some  iterative  software 
library  such  as  PETSc.  This  is,  of  course,  the  ultimate  goal.  However,  measurements  from  such  a  procedure 
do  not  allow  us  to  isolate  the  effects  of  matrix  orderings  from  other  influences.  Instead,  the  measurements 
would  reflect  a  combination  of  influences  from  our  algorithm,  its  implementation,  the  PETSc  library,  the 
MPI  specification  as  implemented  on  some  selected  platform /communication  network,  contention  with  other 
users  for  system  resources,  and  so  on. 

We  therefore  elected  to  examine  ordering  effects  in  a  more  tightly  controlled  environment.  Problems  were 
generated,  and  local  and  global  orderings  carried  out,  using  our  parallel  code.  Each  subdomain’s  interior 
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2D  240  x  240  grid,  levels  0, 7 


2D  600  x  600  grid,  levels  0, 7 


nodes  were  locally  numbered  using  natural  ordering.  At  this  point  the  matrices  were  written  to  file,  after 
which  convergence  behavior  was  studied  on  a  uni-processor  platform. 

For  each  series  of  runs  we  generated  problems  for  2D  grids  ranging  in  size  from  128  x  128  to  600  x  600 
grid-points,  and  optimally  partitioned  the  grids  amongst  1,  4,  9,  16,  and  25  processors.  Our  findings  are 
summarized  in  Figures  3.2  and  3.3.  The  figures  show  that  the  partitioning  and  subsequent  ordering  does 
affect  convergence  behavior,  but  that  the  effect  is  relatively  minor.  Further,  the  effect  decreases  as  the 
problem  size  increases,  relative  to  the  number  of  processors.  The  fastest  solution  time  on  the  240  x  240  grid 
was  18%  slower  for  25  processors  than  for  a  single  processor,  while  there  was  only  a  3.8%  difference  on  the 
600  x  600  grid. 
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Table  3.6 

Symbolic  and  Numeric  factorization  for  3D  Scaled  problem ,  91,125  unknowns  per  processor ,  7-point  stencil,  ILU(2) 
factorization  on  interior  nodes ,  ILU(l)  factorization  on  boundary  nodes,  timings  in  seconds .  Dashes  (-)  for  Beowulf  and  HPC 
10000  indicate  the  machines  have  insufficient  cpus  to  perform,  the  rims.  Dashes  for  the  T3E  indicate  jobs  that,  for  reasons 
partially  unknown,  did  not  complete  successfully.  The  “ variability ”  line  indicates  whether  repeated  results  were  reasonably 
consistent. 


Procs 

T3E 

(NERSC) 

0rigin2000 

AMES 

Beowulf 

(ICASE) 

HPC  10000 
(ODU) 

1 

- 

2.04 

2.27 

2.13 

8 

2.78 

2.44 

3.11 

2.43 

27 

3.32 

2.96 

4.06 

2.97 

64 

- 

3.11 

4.64 

- 

125 

3.50 

3.18 

- 

- 

216 

- 

3.32 

- 

- 

variability 

low 

high 

low 

high 

3.2.  Parallel  Performance.  In  this  subsection  we  report  timing  and  scalability  results  for  precondi¬ 
tioner  factorization  and  application  on  four  parallel  platforms: 

•  a  Cray  T3E  at  National  Energy  Research  Scientific  Computing  Center  (NERSC,  part  of  the  Com¬ 
puting  Sciences  Directorate  at  Lawrence  Berkeley  National  Laboratory); 

•  an  SGI  0rigin2000  at  NASA  Ames  Research  Center  (AMES); 

•  the  Coral  PC  Beowulf  cluster  at  the  Institute  for  Computer  Applications  in  Science  and  Engineering 
(ICASE); 

•  a  Sun  HPC  10000  Starfire  server  at  Old  Dominion  University  (ODU). 

Table  3.6  shows  factorization  timings  for  a  3D  scaled  problem  with  approximately  91,125  unknowns  per 
processor.  Reading  down  any  of  the  columns  shows  that  performance  is  highly  scalable,  e.g.,  for  the  SGI 
0rigin2000  factorization  for  216  processors  and  19.7  million  unknowns  required  only  62%  longer  than  the 
serial  case.  Scanning  horizontally  indicates  that  performance  was  similar  across  all  platforms,  e.g.,  execution 
time  differed  by  less  than  a  factor  of  two  between  the  fastest  (0rigin2000)  and  slowest  (Beowulf)  platforms. 

The  last  row  in  the  table,  “variability,”  summarizes  our  experience  that  timings  for  identical  repeated 
runs  on  the  HPC  10000  and  SGI  typically  varied  by  50%  or  more,  while  repeated  runs  on  the  T3E  and 
Beowulf  were  remar  ably  consistent.  Other  researchers  have  noted  similar  phenomena.  On  the  T3E  the 
variability  is  due  to  contention  with  other  users  for  common  resources.  On  the  HPC  10000,  where  we  had 
essentially  exclusive  use  of  all  resources,  the  high  degree  of  variability  is  attributed  contention  on  the  shared 
memory  bus. 

Table  3.7  shows  similar  data  and  trends  for  the  triangular  solves  for  the  scaled  problem.  Scalability  for 
the  solves  was  not  quite  as  good  as  for  factorization;  e.g.,  the  solve  with  216  processors  took  about  2  1/2 
times  (250%)  longer  than  the  serial  case. 

Table  3.8  shows  speedup  for  a  constant-sized  problem  of  1.7M  unknowns.  As  expected,  there  is  a  clear 
correlation  between  performance  and  subdomain  interior /boundary  node  ratios:  as  the  ratio  drops,  so  does 
performance. 

We  remind  the  reader  that,  for  ILU(k),  the  structure  of  the  symbolic  factor  depends  on  the  grid  and 
stencil  only.  Hence,  the  performances  reported  in  these  tables  is  applicable  to  any  PDE  that  has  been 
discretized  with  a  7-point  central  difference  stencil. 
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Table  3.7 

Triangular  solves  for  3D  Scaled  problem ,  91,125  unknowns  per  processor,  7-point  stencil,  ILU(2)  factorization  on  interior 
nodes,  ILU(l)  factorization  on  boundary  nodes,  timings  in  seconds.  Dashes  (-)  for  Beowulf  and  HPC  10000  indicate  the 
machines  have  insufficient  cpus  to  perform,  the  runs.  Dashes  for  the  T3E  indicate  jobs  that,  for  reasons  partially  unknown, 
did  not  complete  successfully.  The  “ variability ”  line  indicates  whether  repeated  results  were  reasonably  consistent. 


Procs 

T3E 

(NERSC) 

0rigin2000 

AMES 

Beowulf 

(ICASE) 

HPC  10000 
(ODU) 

1 

- 

.182 

.187 

.289 

8 

.487 

.431 

.359 

.515 

27 

.623 

.405 

.508 

.629 

64 

- 

.472 

.556 

- 

125 

- 

.610 

- 

- 

216 

- 

.646 

- 

- 

variability 

low 

high 

low 

high 

Table  3.8 

Speedup  for  3D  constant-size  problem ;  grid  was  120  x  120  X  120,  for  a  total  of  1.7M  unknowns;  data  is  for  ILU(O) 
factorization  performed  on  the  SGI  OriginSOOO;  “I/B”  is  the  ratio  of  interior  to  boundary  nodes  in  each  subdomain. 


Procs 

Efficiency 

(%) 

I/B 

Ratio 

Time 

(seconds) 

Unkowns/ 

Processor 

8 

100 

9.3 

2.000 

216,000 

27 

95 

6.0 

.846 

64,000 

64 

66 

4.3 

.408 

27,000 

125 

53 

3.4 

.307 

13,824 

4.  Performance  Analysis.  In  this  section  we  present  simplified  theoretical  analyses  of  algorithmic 
behavior  for  matrices  arising  from  PDEs  discretized  on  2D  grids  with  five-point  stencils  and  3D  grids  with 
seven-point  stencils.  Since  our  arguments  are  structural  in  nature,  we  assume  ILU(k)  is  the  factorization 
method  of  choice.  After  a  word  about  nomenclature,  we  begin  with  the  2D  case. 

The  word  grid  refers  to  the  grid  (mesh)  of  unknowns;  for  Regular  2D  and  3D  grids  with  five  and  seven 
point  stencils,  respectively,  this  is  identical  to  the  undirected  graph,  G{A).  We  remind  the  reader  that  we 
use  the  terms  eliminating  a  node  and  factoring  a  row  synonymously. 

We  assume  the  grid  has  been  block-partitioned,  with  each  subdomain  consisting  of  a  square  subgrid  of 
dimension  cxc.  We  also  assume  the  subdomain  grid  has  dimensions  p  x  p ,  so  there  are  p  total  processors. 
There  are  thus  N  =  c2p  nodes  in  the  grid,  and  subdomains  have  at  most  4c  =  4 boundary  nodes. 

If  subdomain  interior  nodes  are  locally  numbered  in  natural  order  and  fc<c,  matrix  rows  in  the  factor 
F  can  asymptotically  be  considered  as  having  2k  (strict)  upper  triangular  and  2k  (strict)  lower  triangular 
nonzero  entries.  The  justification  for  this  statement  arises  from  consideration  of  Theorem  1;  the  geometric 
intuition  is  illustrated  in  Figure  4.1. 

Assuming  the  classic  ILU(k)  algorithm  is  used  for  symbolic  factorization,  both  symbolic  and  numeric 
factorization  of  row  j  entails  4/c2  work.  This  is  because,  for  each  lower  triangular  entry  in  matrix  row  j , 
factorization  requires  “touching”  each  upper  triangular  entry  in  row  ?'. 

A  red-black  ordering  of  the  subdomain  grid  gives  an  optimal  bipartite  division.  If  red  subdomains  are 
numbered  before  black  subdomains,  our  algorithm  simplifies  to  the  following  three  stages. 


13 


Fig.  4.1.  Counting  lower  triangular  fill  edges  in  a  naturally  ordered  grid .  From  top  to  bottom,  there  are  two  level  0  edges; 
there  is  one  level  1  edge,  due  to  fill  path  9,  3,  4;  there  is  one  level  2  edge  due  to  fill  path  9,  3,  4)  $>'  there  are  two  level  3  edges, 
due  to  fill  paths  9,  3,  4>  5,  6  and  9,  3,  2,  1,  7.  Two  additional  fill  edges  are  created  for  every  level  greater  than  three.  Since 
there  is  a  total  of  six  fill  edges  for  a  level  3  factorization,  we  conclude  that  asymptotically  there  are  2k  lower  triangular  edges 
in  a  level  k  factorization.  By  symmetry,  the  upper  triangle  must  contain  the  sam,e  number  of  entries. 


1.  Red  processors  eliminate  all  nodes;  black  processors  eliminate  interior  nodes. 

2.  Red  processors  send  boundary-row  structure  and  values  to  black  processors. 

3.  Black  processors  eliminate  boundary  nodes. 

If  these  stages  are  non-overlapping,  the  cost  of  the  first  stage  reduces  to  the  cost  of  eliminating  all  nodes 
in  a  subdomain.  This  cost  is  4 k1 2 3c2  = 

v 

The  cost  for  the  second  stage  is  the  cost  of  sending  structural  and  numerical  values  from  the  upper- 
triangular  portions  of  the  boundary  rows  to  neighboring  processors.  If  k  <C  c,  Theorem  1  can  be  used  to 
show  that,  asymptotically,  a  processor  only  needs  to  forward  values  from  c  rows  to  each  neighbor.  Assuming 
a  standard,  non-contentious  communication  model  wherein  a  and  (5  represent  message  startup  and  cost-per- 
word  respectively,  and  with  time  for  each  operation  normalized  to  unity,  the  cost  for  this  step  is  4(a-f-2fc/3c)  = 
4(a  +  2fc/3v/f). 

Since  the  cost  of  factoring  a  boundary  row  can  be  shown  to  be  asymptotically  identical  to  that  for 
factoring  an  interior  row,  the  cost  for  eliminating  the  4c  boundary  nodes  is  (4/c2)(4c)  =  16/c2y^. 

Speedup  can  then  be  expressed  as 
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speedup  = 


4  k2N 


^lK  +  4(a  +  2  kfijf)  +  16pyf  ’ 

where  the  numerator  represents  cost  for  uni-processor  (sequential)  execution,  and  the  three  denominator 
terms  represent  the  costs  for  the  corresponding  stages  of  the  simplified  algorithm  for  parallel  execution. 

Two  interpretations  of  this  equation  are  in  order.  First,  for  a  fixed  problem  size  and  number  of  processors, 
the  parallel  computational  cost  (the  first  and  third  terms  in  the  denominator)  is  proportional  to  k 2,  while 
the  communication  cost  (the  second  term  in  the  denominator)  is  proportional  to  k .  This  explains  the 
experimentally  observed  increase  in  efficiency  with  level.  Second,  if  the  ratio  N/p  is  large  enough,  the  first 
term  in  the  denominator  will  dominate  the  second  and  third  terms,  and  efficiency  will  approach  100%. 

For  the  3D  case  we  assume  partitioning  into  cubic  subgrids  of  dimension  cxcxc  and  a  subdomain  grid 
of  dimension  p1/3  x  p1/3  x  p1/3,  which  gives  N  =  c3p.  Subdomains  have  at  most  6c2  boundary  nodes.  A 
development  similar  to  that  above  shows  that,  asymptotically,  matrix  rows  in  the  factor  F  have  k 2  (strict) 
upper  and  lower  triangular  entries,  so  the  cost  for  factoring  a  row  is  A;4.  Speedup  for  this  case  can  then  be 
expressed  as 

k4N 

speedup  -  +  6(a  +  ^(^1/3)  +  6Jfc4(£)i/3  * 
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